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FOR  XAST  SKILL  SCORE  TEST 


1.  Backgrpund: 

a.  For  the  last  ten  years,  Air  Weather  Service  has  used  the  following  skill 
score  for  the  Terminal  Forecast  Verification  (TAFVER)  Programs  SS  =  (S-P) /{100%-P) . 
In  this  formula,  S  =  Station  (Det/Sq/Wing)  percent  correct,  and  P  =  Persistence 
percent  correct.  This  skill  score  is  simple  and  straightforward,  but  also  has 

its  limitations. 

(1)  Since  Lfie  AWS  Skill  Score  (SS)  is  sensitive  to  the  quantity  P  as  well 
as  the  difference  (S-P),  climatology  has  a  big  effect  on  a  station's  score.  For 
example,  if  S  beats  P  by  5%  with  P  =  85%,  then  a  skill  ccore  of  .33  results. 

However,  if  P  =  60%,  the  skill  score  would  be  .125. 

(2)  The  AWS  SS  rewards  forecast  hits  equally  regardless  of  category/ 
difficulty,  "resent  TAFVER  contingency  tables  show  that  AWS  forecasters  predict 
the  low  categories  less  frequently  than  they  occur. 

b.  One  of  Dr  Robert  Miller's  first  projects,  when  he  became  AWS  Chief 
Scientist  in  Sep  76,  was  to  review  TAFVER  procedures.  He  noted  that  performance 
in  the  low  categories  (below  200/1/2,  and  200/1/2  to  1000/2)  needed  improvement. 

He  attributed  this  deficiency  to  the  fact  that  since  Persistence  is  the  AWS  SS 
baseline,  forecasters  tend  to  wait  until  they  nave  a  good  chance  of  beating 
Persistence  before  they  go  against  it  (a  "tie"  with  Persistence  is  better  than  a 
"loss").  Consequently,  this  verification  system  results  in  a  reluctance  to 
forecast  the  low  categories.  To  correct  this  problem,  a  system  is  needed  that 
encourages  forecasters  to  forecast  low  categories  as  often  as  they  occur.  One 
approach  would  be  a  system  that  gives  more  credit  for  hitting  the 
climatologically  rare  categories.  To  this  end.  Dr  Miller  led  discussions  which 
resulted  in  the  proposal  of  a  test  of  the  Gringorten  Score.  The  test  was 
designed  so  that,  in  addition  to  the  Gringorten  Score,  the  Log  Score,  developed 
by  McDonald  of  NWS,  could  also  be  examined.  After  considering  test  costs, 

the  AWS  commander  approved  the  test  plan  . 

c.  As  the  test  plan  was  being  developed,  it  was  realized  that  with  no  extra 
cost  or  effort  the  test  results  could  be  used  to  determine  our  capability  to 
produce  skillful  and  reliable  forecasts  in  probabilistic  terms  (probability 
forecasts).  Consequently,  this  objective  was  also  added  to  the  test. 

2 .  Test  Objectives: 

a.  To  determine  which  skill  score,  if  any,  should  replace  the  present  AV?S  SS. 

b.  To  assess  the  participating  units'  capability  to  prepare  reliable  and 
skillful  probability  foreca.sts. 

c.  To  compare  subjeclivo  probability  forecasts  prepared  by  AWS  forecasters 
with  objective  probability  forecasts  prepared  by  the  National  Weather  Service 
Techniques  Development  Labor.atory  (NWS/TDL)  . 

3 .  Tcs t  Schedu le : 

Sep  77  AWS/DN  presented  probability  forecasting  seminar 

to  participating  units. 

1  Oct  77  -  31  Mar  78  Test  conducted. 

Jun  78  Analyze  and  present  results. 

4.  Participating  Units;  Forecasting  units  in  CONUS  Regions  43  and  47  participated 
in  the  test  and  are  listed  in  Attachment  1.  On  1  Dec  77,  seven  more  units  were 
added.  These  units  did  not  receive  the  AWS/DN  probability  seminar;  the  objective 
was  to  see  if  these  units  were  able  to  prepare  probability  forecasts  as  well  as 
the  23  units  that  received  the  seminar. 

5.  Forecast  Unit  Tasks: 
a.  Field  Units; 

(1)  Prepared  ceiling  ana  visioility  probability  forecasts  at  each  regularly 
scheduled  forecast  time  (022,  C8Z,  14Z,  and  20Z) ,  for  each  ceiling  category 
(C200  ft,  ^  200  ft  to  -<1000  ft,  ^1000  ft  to<3000  ft,  ^  3,000  ft)  and  each 
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visibility  category  {■<1/2  mi,  ^1/2  mi  to  c  2  mi,  ^2  mi  to  <3  mi,  3  mi)  for  the 
3-  and  6-hour  verifying  times.  The  probability  for  a  particular  category  could 
range  from  0.00  to  1.00  and  the  sum  of  the  four  categoiies  for  both  ceiling 
or  visibility  had  to  equal  one.  Increments  of  0.01  were  used. 

(2)  Sent  completed  teat  forms  (see  Attachment  2)  to  AFGWC  twice  a  month. 

b.  AFGWC: 

(1)  At  each  regularly  scheduled  forecast  time  (02Z,  08Z,  14Z,  and  20Z) , 
subjectively  assigned  probabilities  to  each  ceiling  category  and  each  visibility 
category  for  the  12-  and  24-hour  verifying  times.  Categories  and  probability 
value  instructions  were  the  same  as  for  field  units. 

(2)  Computed  Brier  Score  and  reliability  and  sharpness  diagrams  for 
each  unit  and  forecast  length  for:  ceiling,  visibility,  and  ceiling/visibility 
combined  forecasts  (the  combined  probabilities  were  calculated  as  in  Atch  3); 
conditional  climatology  forecast;  sample  climatology  forecast;  and  persistence 
forecast.  Sent  monthly  verification  feedbac)c  to  each  unit. 

(3)  Using  the  probability  forecasts  and  weighting  matrices,  derived 
categorical  forecasts  that  would  maximize  each  of  the  test  s)cill  scores.  For 
example,  to  maximize  the  AWS  SS,  the  category  with  the  highest  probability  was 
selected  as  the  forecast.  These  were  then  )cnown  as  categorical  forecasts  by 
“PROB."  Categorical  forecasts  by  "GRING"  maximized  the  Gringorten  s)cill  score 
and  were  determined  by  multiplying  the  categorical  probabilities  by  the  inverse 
of  the  long-term  climatology  probability  for  the  sane  category.  The  highest 
product  was  the  categorical  forecast.  Forecasts  by  "‘OG"  were  determined  by 
multiplying  the  same  probabilities  by  a  matrix  that  tailored  the  Log  Score  to 
the  AWS  categories.  Then,  these  three  sets  of  categorical  forecasts,  each 
chosen  to  maximize  a  s)cill  score,  were  verified  using  each  of  the  three  s)cill 
scores  and  percent  correct.  Results  were  computed  monthly  and  sent  to  AWS/DOA 
for  analysis. 

(4)  Verified  NWS/TDI.  model  output  statistics  (MOS)  12-  and  24-hour 
forecasts  for  the  test  units,  AFGWC's  liaison  staff  at  TPL  provided  tapes  of  MOS 
ceiling  and  visibility  forecasts  precise  to  two  digits. 

6 .  Test  Results: 

a.  Attachment  3  defines  each  of  the  s)<ill  scores  used  and  Attachment  4 
summarizes  verification  of  the  three  categorical  forecasts  by  each  sltill  score. 
Findings  are: 

(1)  Forecasts  by  PROB  were  best  for  percent  correct  and  AWS  S)cill  Score 
(which  is  based  on  percent  correct) .  Forecasts  by  LOG  were  best  for  Log  Score. 
Forecasts  by  GRING  did  not  always  score  best  for  Gringorten  Score. 

(2)  For  all  s)cill  measures,  PROB  and  LOG  were  nearly  the  same. 

(3)  All  s)till  measures  showed  forecast  s)cill  deteriorates  with  increasing 
forecast  length. 

(4)  MOS  represents  verification  of  the  category  with  the  highest 
probability  from  MOS  12-  and  24-hour  probability  forecasts.  Thus,  MOS  is 
analogous  to  PROB  (12-  and  24-hour  forecasts  from  AFGWC) ,  and  scored  better  than 
PROB  for  all  measures.  Later  results  will  show  that  MOS  scored  better  than  AFGWC 
in  the  Brier  Score  also. 

b.  Attachment  5  is  the  six  month  summary  of  Brier  Scores  for  3-  and  6-hour 
forecasts  by  23  field  units  who  participated  in  the  entire  test,  for  12-  and 
24-hour  forecasts  of  22  stations  by  AFGWC  and  TDL  MOS,  and  for  two  "controls"  -  - 
conditional  climatology  and  sample  climatology.  Approximately  12,000  forecasts 
were  made  for  each  verifying  hour.  This  summary  shows  that  the  field  units  beat 
both  conditional  and  sample  climatology  for  all  categories  and  AFGWC  beat  them  for 
the  12-  and  24-hour  combined  CIG/VSBV  forecasts  as  well  as  the  12-  and  24-hour 
CIG  forecasts.  Also,  MOS  beat  both  AFGWC  and  climatology  for  all  12-  and 
24-hour  forecasts. 
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c.  Attachment  6  shows  the  percent  improvement  of  the  average  of  the  station 
forecast  Brier  Scores  over  the  average  of  the  station  conditional  climatology 
Brier  Scores  for  each  month  of  the  test.  Also  shown  is  the  percentage  of 
stations  whose  Brier  Score  exceeded  conditional  climatology  Brier  Score  in  each 
month.  Field  units  started  high  at  3-hours  and  maintain<id  this  level; 
performance  at  6-hours  was  more  erratic.  AFGWC  showed  little  skill  with  respect 
to  conditional  climatology  in  October  but  improved  rapidly  after  receiving 
verification  feedback. 

d.  Attachment  7  shows  the  month-to-month  percent  improvement  over 
conditional  climatology  for  3-  and  6-hour  ceiling  and  visibility  forecasts. 
Attachment  8  shows  the  same  data  for  12-  and  24-hour  forecasts. 

e.  Attachment  9  shows  a  comparison  of  subjective  (AFGWC)  and  objective 
(MOS)  probability  forecasts.  MOS  performance  was  relatively  consistent 
throughout  the  period;  as  noted  earlier,  AFGWC  showed  improvement  in  the  first 
four  months*  However,  for  Jan-Mar,  MOS  still  maintained  an  edge  over  AFGWC 
forecast  skill. 

f.  Attachments  10-18  are  probability  forecast,  reliability,  and 
distribution  plots  of  forecasts  by  the  23  field  units,  AFGWC,  and  MOS  for 
category  D  (  i  3000  fcst)  ceilings,  and  category  D  (3-3  miles)  visibilities  and 
the  23  field  unit  forecasts  for  category  A,  B,  and  C  ceilings.  The  MOS 
results  cover  the  entire  six  month  test  period  and  are  thus  not  directly 
comparable  to  the  field  units  and  AFGWC  results  shown  in  the  attachments,  which 
only  cover  the  final  two  months  of  the  test.  The  forecasts  were  grouped  into 
11  probability  intervals  (0-5%,  5-15%,  15-25%,  ....85-95%,  95-100%)  and  the 
results  plotted  at  the  midpoints  of  the  intervals.  The  dashed  line  on  the 
reliability  plots  shows  the  locus  of  perfectly  reliable  forecasts  and  the 
points  connected  by  the  solid  line  show  the  actual  reliability  results.  The 
fraction  of  the  forecasts  falling  within  each  probability  interval  is  indicated 
by  the  length  of  the  horizontal  lines  in  the  distribution  plots.  Note  that 
different  horizontal  scales  are  used  in  the  distribution  plots.  The  short 
vertical  lines  on  the  forecast  distribution  plots  indicate  a  modeled 
distribution  which  assumes  the  forecasts  are  perfectly  reliable  and  the 
correlation  between  forecast  probabilities  and  observations  is  given  by  R  =  0.98^ 
where  t  is  in  hours  (for  a  12-hour  forecast  R  =  0.785).  The  total  number  of 
forecasts  in  each  sample,  the  fraction  of  the  time  the  category  occurred,  and 

the  overall  forecast  bias  are  also  shown.  The  bias  was  calculated  by. 


Bias  =  'i=l 


Where  0  is  the  total  number  of  times  the  category  was  observed,  Ni  is  the  number 
of  forecasts  in  the  ith  probability  interval,  and  is  the  mean  probability  for 
the  ith  probability  interval  (0.025,  0.1,  0.2,  ....0.9,  or  0.975). 

(1)  The  3-and  6-hour  category  D  ceiling  results  in  Atch  10  show 
generally  good  reliability  and  excellent  sharpness.  As  indicated  by  the 
reliability  plots  and  the  bias  there  was  a  tendency  to  underforecast  (assign  too 
low  a  probability)  the  occurrence  of  D  ceilings  for  both  forecast  periods. 

The  probabilities  forecast  most  frequently  (0-5%,  85-95%,  and  95-100%)  were  very 
reliable.  The  nearly  20%  reliability  error  at  60%  probability  in  the  3-hour 
forecasts  was  based  on  less  than  2%  of  the  total  forecasts.  The  AFGWC  category  D 
coiling  results  shown  in  Atch  11  are  outstanding.  Some  breakdown  from  perfect 
reliability  occurs  for  the  less  frequently  used  low  probabilities.  MOS  tended  to 
under  forecast  category  D  ceilings  (Atch  12).  For  probabilities  above  50%,  the 
AFGWC  forecasts  were  more  reliable  than  those  for  MOS  while  the  opposite  was 
true  below  50%.  (Remember  that  different  periods  of  record  are  plotted  for  the 
AFGWC  and  MOS  results).  Somewhat  of  a  surprise  was  the  size  of  the  negative 
bias  for  these  objective  MOS  forecasts. 


(2)  The  cateyory  D  visibility  results  (Atch  13-15)  are  poorer  than 
those  for  ceilings.  The  field  units  generally  underforecasted  this  event, 

MOS  overforecasted  it,  and  AFGWC  forecasts  definitely  exhibited  the  characteristic 
of  overconfidence  (ovcrforecasting  at  high  probabilities  and  underforecasting  at 
low  probabilities,  an  attempt  to  forecast  with  greater  sharpness  than  warranted 
by  forecast  skill).  The  broken  line  for  the  95-100%  interval  in  the  distribution 
plots  in  Atch  13  and  14  indicates  the  extension  of  the  line  beyond  the  end  of 
the  liOi iicr.tal  scala  tu  t)rti  valut  bbjvw.  at.  tht  -f  tLt  irrow.  H.c  ii. 

parenthesis  is  that  for  the  model  distribution.  The  erratic  reliability  results 
for  AFGWC  at  probabilities  below  55%  were  based  on  less  than  5%  of  the  total 
forecasts.  The  AFGWC  forecasts  were  too  sharp,  especially  at  24-hours.  The  MOS 
category  D  visibility  forecasts  were  very  reliable,  had  little  overall  bias,  and 
showed  a  good  match  to  the  model  distributions.  The  large  reliability  error  in 
the  5-13%  interval  for  12-hour  D  visibility  forecasts  was  the  result  of  three 
occurrences  of  category  D  out  of  four  forecasts.  This  error  is  mostly  likely 
due  to  sampling  effects  and  some  basic  instability  in  the  MOS  equations  at  the 
less  frequently  used  low  probabilities;  i.e.,  insufficient  low  visibility  cases 
available  for  equation  development.  This  reliability  error  occurred  for  just 
four  forecasts  out  of  10,838.  The  MOS  probability  distributions  for  D 
Visibility  (Atcli  15;  Iii.  a  laodei  lAisciiLutiua  yeiiSiated  uaiby  K  -  ti .  ST'- 

4‘^twrl  bf-ftor  ths'  uatrg  0.  Thlw  is  t-dUcatiifin  ’jf  Laalcelly 

lower  skill  in  predicting  visibility  which  is  also  seen  in  the  Brier  Scores 
and  other  verification  results.  This  effect  of  lower  skill  is  not  easily 
detectable  in  the  field  unit  and  AFGWC  distributions  because  of  overriding 
reliability  problems. 

(3)  The  field  unit  results  for  category  A,  B,  and  C  ceiling  forecasts 
(Atch  16,  17,  and  18)  all  show  a  basic  tendency  to  overforecast.  This  is  seen 
most  clearly  in  the  large,  positive  overall  biases  and  the  departures  from  the 
model  distributions.  The  reliability  results  also  show  this.  The  erratic 
reliability  plot  for  category  A  is  the  result  of  event  rarity  (less  than  1% 
frequency)  and  sample  size  problems  in  the  higher  probability  intervals.  In 
particular,  the  forecasters  at  the  individual  units  did  not  have  enough  cases 

to  adequately  identify  their  overforecasting  problems  with  category  A.  Using  the 
rnodt'  is'Ttt  UT  icTis  ae  ^ui^lar/cv  crif  14  ^  JUU  !  -  3it<V 

be  for  probabilities  greater  than  5%  for  a  3-hour  forecast  and  30  out  of  1000 
for  a  6-hour  forecast.  By  contrast  the  units  placed  53  and  61  forecasts 
per  1000  for  3-and  6-hours  respectively  at  probabilities  above  5%,  approximately 
4  and  2  times  the  model  amounts.  The  category  B  results  (Atch  17)  are  quite 
good.  The  erratic  reliability  at  6-hours  again  reflects  sample  size  effects 
rather  true  reliability  problems.  The  distribution  plots  for  both  A  and  B 
indicate  a  forecaster  perference  for  60,  BO,  and  greater  tlian  95%  probabilities 
vice  50,  70,  and  90%  values.  The  reliability  pattern  at  high  probability  values 
for  category  C  ceiling  (Atch  18)  is  rather  puzzling.  It  appears  to  be  the 
result  of  forecaster  overuse  of  5  to  85%  probability  values;  i.e.,  forecasting 
with  less  sharpness  than  skill  would  dictate,  as  well  as  an  over forecasting 
problem.  It  may  also  be  that  with  four  ceiling  categories  insufficient  attention 
is  given  to  the  assessment  of  the  probabilities  of  each  of  the  three,  rarer  low 
ceiling  categories  after  the  assignment  of  a  probability  for  category  D.  The 

and  strung  positive  bias  tor  categories  A,  B,  and  C  are  direct 
results  of  underforecasting  and  negative  bias  for  category  D. 

g.  Attachment  19  summarizes  a  comparision  of  the  original  23  field  units 
and  the  7  field  units  added  on  1  Dec  77.  The  23  units  which  received  the  seminar 
scored  better  than  did  the  7  units  which  did  not,  regardless  of  the  period  of 
comparison . 

h.  Attachments  20-23  show  the  3-,  6-,  12-  and  24-hour  contingency  tables 
lor  persistence  and  the  categorical  forecasts  which  maximire  AWS,  Log,  and 
Gringorten  skill  scoring  methods.  Attachments  24-26  summarize  these  tables. 

These  data  indicate  that  forecasts  maximized  for  AWS  and  Log  Skill  Scores  were 
test  and  nearly  equal  for  ail  hours  and  catecories.  Forecasts  maximized  for  AWS 
Skill  Score  had  more  correct  hits  but  forecasts  maximized  for  Log  Skill  Score 
were  less  biased  between  optimistic  and  pessimistic  forecasts.  Additionally, 
lorecasts  maximized  for  AWS  Skill  Score  wore  better  for  Category  A;  forecasts 
maximized  for  Log  Skill  .'icore  were  bettei  for  Category  B.  The  Log  Skill  Score 
(when  compared  to  AWS  Skill  Score)  does  encourage  the  forecaster  to  make  more 
Category  B  forecasts.  Forecasts  to  maximize  the  Gringorten  Store  show  more 

*  I  teTrsrtfe  let  tli«  ■''itegt,*  le-»  irrt;^e  lei-ecsst.*  weft  a-Suallj.  pessimistic. 


i.  During  the  test,  problems  collecting  ancl  processing  the  forecast  data 
resulted  in  about  a  25%  data  loss.  Some  of  these  problems  were  not  all 
information  recorded  on  the  form,  forms  were  misplaced,  and  data  were 
incorrectly  entered  on  punch  cards.  However,  we  believe  the  overall  impact 
the  test  was  negligible  and  should  npt  bias  -the  results. 

7 .  Summary  of  Test  Results; 

a.  Categorica)  forecasts  made  to  maximize  the  Gringorten  Score  or  LOG 
Score  are  not  significantly  better  than  categorical  forecasts  made  to  maximize 
the  AWS  Skill  Score. 

b.  Forecasts  made  to  maximize  the  Gringorten  Score  were  more  pessimistic. 

c.  AWS  forecasters  can,  with  training  and  verification  feedback,  issue 
skillful  probability  forecasts. 

d.  AFGWC  12-  and  24-hour  probability  forecasts  almost  equal  TDL  MOS 
probability  forecasts. 

e.  The  AWS/DN  probability  forecasting  seminar  is  of  value  to  novice 
probability  forecasters. 


Participating  Units 


3WW: 

Det  9,  12WS  Tyndall  AFB,  FL  PAM 

*Det  4,  26WS  Loring  AFB,  ME  LIZ 

*Det  6,  26WS  Pease  AFB,  NH  PSM 

*Det  8,  26WS  Griffiss  AFB,  NY  RME 

*Det  12,  26WS  Plattsburg  AFB,  NY  PBG 

Det  14,  26WS  Blytheville  AFB,  AR  BYH 

Det  18,  26WS  Rickenbacker  AFB,  OH  LCK 

*Det  19,  26WS  Whiteman  AFB,  MO  SZL 

Det  20,  26WS  Barksdale  AFB,  LA  BAD 

*Det  22,  26WS  Carswell  AFB,  TX  FWH 

*Det  23,  26WS  McConnell  AFB,  KS  lAB 

Det  24,  26WS  K.  I.  Sawyer  AFB,  MI  SAW 

Det  26,  26WS  Grissom  AFB,  IN  GUS 

Det  28,  26WS  Wurtsmith  AFB,  MI  OSC 

5WW: 

Det  5,  3WS  England  AFB,  LA  AEX 

Det  12,  3WS  Self ridge  ANGB,  MI  MTC 

Det  31,  3WS  Dobbins  AFB,  GA  MGE 

Det  75,  3WS  Hurlburt  AFB,  FL  HRT 

Det  1,  5WS  Ft  Campbell,  KY  HOP 

Det  5,  5WS  Ft  Knox,  KY  FTK 

Det  10,  5WS  Ft  Benning,  GA  LSF 

•ODet  31,  5WS  Ft  Polk,  LA  POE 

Det  2,  24WS  Columbus  AFB,  MS  CBM 

Det  9,  24WS  Maxwell  AFB,  AL  MXF 

Det  22,  24WS  Keesler  AFB,  MS  BIX 

7WW: 

Det  9,  7WW  Scott  AFB,  IL  BLV 

Det  20,  7WW  Little  Rock  AFB,  AR  LRF 

Det  13,  15WS  Robins  AFB,  GA  WRB 

Det  15,  15WS  Wright-Patterson  AFB,  OH  FFO 

AFGWC: 

Det  10,  2WS  Eglin  AFB,  FL  VPS 

* Added  on  I  Dec  77. 

**No  12-  or  24-hour  forecasts  were  made  for  Ft  Polk. 
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SKILL  SCORES 


The  AWS  Skill  Score  =  (Unit  percent  correct-  persistence  percent  correct)/ (100%  - 
persistence  percent  correct).  This  score  weights  all  correct  forecasts  equally, 
a  hit  from  predicting  the  difficult  to  forecast  bad  weather  categories  (A  and  B) 
is  worth  the  same  as  a  correct  prediction  of  easier  to  forecast  good  weather. 

This  score  can  range  from  -  oo  to  a  maximum  possible  of  +1.  A  negative  score 
indicates  the  absence  of  s)cill.  The  greater  the  number  above  zero,  the  greater 
the  s)cill. 

The  Gringorten  S)till  Score  (GSS)l  gives  greater  weight  to  correct  forecasts  of  the 
harder  to  predict  bad  weather  categories.  The  weight  for  each  category  is 
irtversely  proportional  to  the  climatological  fr»-quency  of  occurrence  of  the 
category.  A  correct  forecast  of  a  weather  cate  jry  which  occurs  2%  of  the  time 
would  be  given  a  weight  of  50,  (1/0.02);  whereas,  a  correct  forecast  of  a 
category  which  occurs  80%  of  the  time  would  be  given  a  weight  of  1.25,  (1/0.8). 
"!<c  GSS  is  calculated  as  follows: 

G 

(2  »i  Wi)  -N 

GSS  =  i^l _ 

/  J  Oi  WiN  -N 
'  i=l  ' 

W)ure  N  is  the  number  of  forecasts.  Hi  is  the  number  of  forecast  hits  in  category 
i,  Oi  is  the  number  of  observations  in  category  i,  Wi  is  the  weighting  factor  for 
(itcgory  i  ( 1 /climatological  frequency  of  catego’-y  i) ,  and  G  is  the  greater 
f;  (a)  tl.o  number  of  categories  in  which  at  leai  t  one  observation  occurred,  or 
iLj  the  number  of  categories  for  which  at  least  oi.t  forecast  was  issued.  This 
..core  can  also  range  from  -ao  to  +  1  where  +  1  ir,  perfect  forecasting.  For  the 
t'st,  the  weighting  factors  were  calculated  using  the  observed  frequencies 
f  the  occurrence,  N/Oj ,  rather  than  the  climatologicnl  frequencies.  With  this 
change,  the  Gringorten  S)till  Score  becomes 


(2 

GSS  =  i»l 
"G^T 

■■  i  l.wg  S)till  Score  ^is  '  penalty  score;  i.e.,  correct  forecasts  are  given  a 
ight  of  zero’  and  mis  .t>.l  forecasts  are  given  "penalty  points."  The  Log  Score 
t.ikv^s  the  "closeness"  ...  incorrect  forecasts  into  account  by  giving  relatively 
■  iw  penalty  points  to  one  category  busts  compared  to  the  maximum  penalties 
i  .'.icssi'd  for  three  category  busts.  The  penalty  matrix  for  ceiling  forecasts  is: 
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fORECAST 


A 

B 

c 

D 

A 

“  0 

2.% 

58 

81 

B 

35 

0 

15 

39 

C 

63 

16 

0 

10 

D 

89 

38 

16 

0 

i'  .  iji  (  cnalty  matrix  is  used  for  visibility  forecasts.  The  Log  Score 
i;  computed  by  multiplying  the  elements  of  the  verification  matrix  by  the 
urresponding  elements  of  the  penalty  matrix  and  sumning  of  the  products,  i.e.. 


ATCH  3 


LS  = 


1 

N 


s 

j=l 


Where  Nij  are  the  eienents  of  the  verification  matrix  and  Mij  are  the  elements 
of  the  penalty  matrix.  The  lower  the  score,  the  greater  the  forecast  shill. 

A  perfect  score  is  zero  and  the  maximutn  score  (for  ceiling  forecasts)  is  89 
(forecast  category  A  every  time  and  observe  only  category  D) . 

1.  Gringorten,  I.  I.,  1967  Journal  of  Applied  Meteorology,  6,  pp  742-747. 

2.  MacDonald,  A.E.,  1977,  Western  Region  Technical  Attachment  No.  77-18. 


Coi^ining  cei ) ing-visibility  probabilities:  The  probabilities  for  combined 
ceiling-visibility  categories  wer^ calculated  by  7VFGWC  using  the  relation 
proposed  by  Capt  Al  Boehm: 

Pcv  =  (1  -  ^)  PcPv  t  ^ MIN  (Pc,  Pv)  where  Pcv  is  the  combined  probability 
for  ceiling-visibility  category,  Pv  is  the  assigned  probability  for  visibility 
in  the  category.  Pc  is  the  assigned  probability  for  ceiling  in  the  category, 

^  i.s  the  correlation  between  ceiling  and  visibility.  0.3  was  used  for  the  value 
of  the  correlation. 
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SKILL  SCORE  COMPARlSOt’ 
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PERFORMANCE  TREND  BY  FORECAST  LENGTH  (HOURS) 
COMBINED  CiG/VSBY 
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BRIER  SCORE  OVER  CC  BRIER 
SCORE 


PERCENT  IMPROVEMENT  STATION  BRIER  SCORE 


Comparison 


1  I - 

1  AVERAGE  LAST  3 

“1 - 

MONTHS 

1  1  26  26 

11 

Subjective  (AFGWC)  and  Objective  (MOS)  Probability  Forecasts 


Percent  Improvement  Forecast  Brier  Score  over  Conditional  Climatology  Brier  Score 
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OBSERVED  FREQUENCY  FORECAST  FREQUENCY 


JETACHflE.'lT  CATEGORY  D  VISIBILITY  FEB-MAR 


OBSERVED  FREOL’ENCY  FORECAST  FREQUENCY 


DETACHriE^lT  CATEGORY  A  CElLIfIG  FEB-MAR 


adn; 


Comparison 

of 


Original  23  Units  and  7  Units  Added  in  December 


Percent  Improvement  Forecaster 

Brier  Score 

over 

Conditional  Cl 

•imatology  Brier 

3  HOUR 

6  HOUR 

Cig 

Vsby 

Cig 

Vsby 

Original  23  (Oct-Mar) 

36 

13 

24 

6 

Original  23  (Dec-Mar) 

35 

12 

24 

10 

Added  7  (Dec-Mar) 

27 

5 

22 

0 

Original  23  (Feb-Mar) 

34 

17 

25 

9 

Added  7  (Feb-Mar) 

30 

9 

23 

5 
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OBSERVED 


CONTINGENCY  TABLES  FOR  CIG/VIS  AT  3  HR 


I'l'l  a  I  nl  »MH  i: 
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CONTINGENCY  TABLES  FOR  CIG/VIS  AT  6  HR 


Persistence  Maximizes  AWS  Skill  Score 
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OBSERVED 


CONTINGENCY  TABLES  FOR  CIG/VIS  AT  12  HR 


Persistence 


Maximizes  AWS  Skill  Score 
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CONTINGENCY  TABLES  FOR  CIG/VIS  AT  24  HR 
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Overall  Percent  of  Correct,  Optimistic,  and  Pessimistic  Forecasts 
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%  of  Correct  Forecasts 
%  of  Optimistic  Forecasts 
%  of  Pessimistic  Forecasts 


%  of  Correct  Forecasts 
%  of  Optimistic  Forecasts 
%  of  Pessimistic  Forecasts 


%  of  Correct  Forecasts 
%  of  Optimistic  Forecasts 
%  of  Pessimistic  Forecasts 


%  of  Correct  Forecasts 
%  of  Optimistic  Forecasts 
%  of  Pessimistic  Forecasts 

‘Made  of  22  stations  for  CIG/VSBY  combined. 


24  HOUR 


Pers 

AWS 

Log 

66.0 

72.6 

71.7 

17.2 

17.9 

17.1 

17,8 

9.5 

11.2 

12  HOUR 


Pers 

AWS 

Log 

69.2 

75.2 

74.0 

52.1 

15.0 

14.1 

13.8 

6.3 

15.8 

10.7 

12.2 

41.6 

6  HOUR 


Pers 

AWS 

Log 

74.6 

83.4 

83.1 

70.9 

11.6 

8.5 

8.8 

4.7 

13.8 

8.1 

8.1 

24.4 

80.5 

88.5 

88.0 

76.6 

8.3 

5.1 

5.6 

2.7 

11.2 

6.4 

6.4 

20.7 

ATCH  24 


#  of 
«  of 

#  of 


»  of 
»  of 
#  of 


*  of 

#  of 
«  of 


i  of 
i  of 
i  of 


*For 


Number  of  Hits  and  Busts* 


3  HOUR 


Pers 

AWS 

Log 

Gring 

1  of  Hits 

9392 

10,296 

10,241 

8911 

1 

Cat  Busts 

1902 

1,119 

1,174 

2255 

2 

Cat  Busts 

321 

189 

192 

344 

3 

Cat  Busts 

52 

29 

25 

119 

6 

HOUR 

Pers 

AWS 

Log 

Gring 

#  of  Hits 

8696 

9684 

9641 

8233 

1 

Cat  Busts 

2274 

1546 

1602 

2695 

2 

Cat  Busts 

578 

331 

330 

526 

3 

Cat  Busts 

110 

44 

33 

152 

12 

HOUR 

Pers 

AWS 

Log 

Gring 

#  of  Hits 

8110 

8737 

8620 

6069 

1 

Cat  Busts 

2370 

2048 

2234 

3352 

2 

Cat  Busts 

1025 

730 

713 

1269 

3 

Cat  Busts 

212 

99 

76 

953 

24 

HOUR 

Pers 

AWS 

Log 

Gring 

f  of  Hits 

7616 

8455 

8350 

5723 

1 

Cat  Busts 

2604 

2129 

2294 

3434 

2 

Cat  Busts 

1274 

928 

888 

1563 

3 

Cat  Busts 

217 

138 

108 

924 

22  stations  for  CIG/VSBY  combined. 


ATCH  25 


Ratio  of  Forecasts*  to  Observations  Made  by  Category 


category 


Category 


Category 


Category 


3  HOUR 


Pers 

AWS 

Log 

Gring 

A 

1.270 

1.025 

.820 

4.245 

B 

1.07,1 

1.009 

1.118 

.  889 

C 

1.031 

1.057 

.941 

1.511 

0 

.977 

.987 

.999 

.837 

6  HOUR 

Pers 

AWS 

Log 

Gring 

A 

1.067 

.732 

.556 

3.937 

B 

1.072 

.978 

1.116 

.877 

C 

1.040 

1.054 

.914 

1.628 

0 

.979 

1.000 

1.013 

.809 

12  HOUR 

Pers 

f— - — 

AWS 

Log 

Gring 

A 

1.030 

.  303 

.100 

7.973 

B 

1.002 

1.021 

1.193 

1.105 

C 

1.042 

.871 

.933 

1.746 

0 

.991 

1.040 

1.011 

.636 

24  HOUR 

Pers 

AWS 

Log 

Gring 

A 

1.027 

.169 

.031 

7.396 

B 

.943 

.799 

.884 

1.232 

C 

1.036 

.730 

.923 

1.668 

0 

1.001 

1.107 

1.062 

.642 

•Made  for  22  stations  for  CIG/VSBY  combined. 


ATCH  26 


